audio representation
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Tseng, Wei-Cheng, Zhou, Xuanru, Huo, Mingyue, Shao, Yiwen, Zhang, Hao, Yu, Dong
Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding. Early advances relied on supervised learning, where models trained on labeled corpora were adapted to related downstream tasks or transferred across domains (Kong et al., 2020; Chen et al., 2022a; Snyder et al., 2018; Desplanques et al., 2020).
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Netherlands (0.04)
- Asia > Middle East > Lebanon > Keserwan-Jbeil Governorate > Blat (0.04)
- Asia > Middle East > Jordan (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks
Pepino, Leonardo, Riera, Pablo, Kamienkowski, Juan, Ferrer, Luciana
Artificial neural networks (ANNs) are increasingly powerful models of brain computation, yet it remains unclear whether improving their task performance also makes their internal representations more similar to brain signals. To address this question in the auditory domain, we quantified the alignment between the internal representations of 36 different audio models and brain activity from two independent fMRI datasets. Using voxel-wise and component-wise regression, and representation similarity analysis (RSA), we found that recent self-supervised audio models with strong performance in diverse downstream tasks are better predictors of auditory cortex activity than older and more specialized models. To assess the quality of the audio representations, we evaluated these models in 6 auditory tasks from the HEAREval benchmark, spanning music, speech, and environmental sounds. This revealed strong positive Pearson correlations ($r>0.7$) between a model's overall task performance and its alignment with brain representations. Finally, we analyzed the evolution of the similarity between audio and brain representations during the pretraining of EnCodecMAE. We discovered that brain similarity increases progressively and emerges early during pretraining, despite the model not being explicitly optimized for this objective. This suggests that brain-like representations can be an emergent byproduct of learning to reconstruct missing information from naturalistic audio data.
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
GRAM: Spatial general-purpose audio representation models for real-world applications
Yuksel, Goksenin, van Gerven, Marcel, van der Heijden, Kiki
Although audio foundations models have seen great progress on a wide variety of tasks, their application in real-world acoustic environments with reverberation and noise has been less successful. Moreover, as audio foundation models are typically trained on dry, single-channel audio clips, the inherent spatial nature of real-world sound scenes is overlooked and tasks involving sound localization ruled out. To address these limitations, we propose GRAM: a General-purpose Real-world Audio Model utilizing a multi-channel masked auto-encoder approach to efficiently learn spatial audio representations from high-quality simulated real-world scenes. To evaluate the performance of GRAM and other audio foundation models in real-world sound scenes, we release Nat-HEAR: A naturalistic version of the HEAR benchmark suite comprising a simulated real-world version, as well as two new sound localization tasks. We show that the performance of GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR, while using only a fraction of the training data. GRAM also showcases state-of-the-art localization performance, surpassing even supervised sound localization approaches, and can be flexibly applied either to a two-channel, binaural sound format or a four-channel, Ambisonics format. Validating GRAM's performance on real-world sound recordings demonstrates robust transfer to real-world scenes. Taken together, GRAM presents a significant advancement towards robust, spatial audio foundation models for real-world applications.
- North America > United States (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > China > Beijing > Beijing (0.04)
- (3 more...)
Unify Variables in Neural Scaling Laws for General Audio Representations via Embedding Effective Rank
Deng, Xuyao, Sun, Yanjie, Dou, Yong, Xu, Kele
Scaling laws have profoundly shaped our understanding of model performance in computer vision and natural language processing, yet their application to general audio representation learning remains underexplored. A key challenge lies in the multifactorial nature of general audio representation-representation quality is jointly influenced by variables such as audio length, embedding dimensionality, model depth, model architecture, data volume, etc., many of which are difficult to isolate or express analytically. In this work, we present a systematic study of scaling laws for general audio representations by utilizing embedding effective rank (RankMe) as a unifying metric that encapsulates the impact of diverse variables on representation quality. RankMe enables a label-free, information-theoretic quantification of audio embeddings, allowing us to examine scaling behaviors across a wide hyper-parameter space, including model size, training data volume, computational budget, architectural configurations, etc. Our empirical findings reveal a consistent power-law relationship between RankMe and representation quality, suggesting that embedding effective rank serves as a reliable proxy for assessing and predicting model performance in audio representation learning. This work not only validates the applicability of classical scaling principles to the general audio domain but also offers a theoretically grounded and empirically robust framework for guiding future model scaling strategies in audio foundation models.
- Asia > China > Beijing > Beijing (0.05)
- Asia > China > Hunan Province > Changsha (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Representation-Based Data Quality Audits for Audio
Gonzalez-Jimenez, Alvaro, Gröger, Fabian, Wermelinger, Linda, Bürli, Andrin, Kastanis, Iason, Lionetti, Simone, Pouly, Marc
ABSTRACT Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review. Index T erms-- Data quality, dataset auditing, representation learning, near-duplicate detection, label errors 1. INTRODUCTION High-stakes audio applications, from predictive maintenance and safety monitoring to large-scale media search, depend on data that is abundant and trustworthy [1, 2, 3].
An overview of neural architectures for self-supervised audio representation learning from masked spectrograms
Yadav, Sarthak, Theodoridis, Sergios, Tan, Zheng-Hua
In recent years, self-supervised learning has amassed significant interest for training deep neural representations without labeled data. One such self-supervised learning approach is masked spectrogram modeling, where the objective is to learn semantically rich contextual representations by predicting removed or hidden portions of the input audio spectrogram. With the Transformer neural architecture at its core, masked spectrogram modeling has emerged as the prominent approach for learning general purpose audio representations, a.k.a. audio foundation models. Meanwhile, addressing the issues of the Transformer architecture, in particular the underlying Scaled Dot-product Attention operation, which scales quadratically with input sequence length, has led to renewed interest in recurrent sequence modeling approaches. Among them, Selective structured state space models (such as Mamba) and extended Long Short-Term Memory (xLSTM) are the two most promising approaches which have experienced widespread adoption. While the body of work on these two topics continues to grow, there is currently a lack of an adequate overview encompassing the intersection of these topics. In this paper, we present a comprehensive overview of the aforementioned research domains, covering masked spectrogram modeling and the previously mentioned neural sequence modeling architectures, Mamba and xLSTM. Further, we compare Transformers, Mamba and xLSTM based masked spectrogram models in a unified, reproducible framework on ten diverse downstream audio classification tasks, which will help interested readers to make informed decisions regarding suitability of the evaluated approaches to adjacent applications.
- Asia > China > Beijing > Beijing (0.04)
- Europe > Switzerland (0.04)
- Europe > Denmark > North Jutland > Aalborg (0.04)
- (4 more...)
- Research Report (1.00)
- Overview (1.00)
Neural Proxies for Sound Synthesizers: Learning Perceptually Informed Preset Representations
Combes, Paolo, Weinzierl, Stefan, Obermayer, Klaus
Deep learning appears as an appealing solution for Automatic Synthesizer Programming (ASP), which aims to assist musicians and sound designers in programming sound synthesizers. However, integrating software synthesizers into training pipelines is challenging due to their potential non-differentiability. This work tackles this challenge by introducing a method to approximate arbitrary synthesizers. Specifically, we train a neural network to map synthesizer presets onto an audio embedding space derived from a pretrained model. This facilitates the definition of a neural proxy that produces compact yet effective representations, thereby enabling the integration of audio embedding loss into neural-based ASP systems for black-box synthesizers. We evaluate the representations derived by various pretrained audio models in the context of neural-based nASP and assess the effectiveness of several neural network architectures, including feedforward, recurrent, and transformer-based models, in defining neural proxies. We evaluate the proposed method using both synthetic and hand-crafted presets from three popular software synthesizers and assess its performance in a synthesizer sound matching downstream task. While the benefits of the learned representation are nuanced by resource requirements, encouraging results were obtained for all synthesizers, paving the way for future research into the application of synthesizer proxies for neural-based ASP systems.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (21 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
USAD: Universal Speech and Audio Representation via Distillation
Chang, Heng-Jui, Bhati, Saurabhchand, Glass, James, Liu, Alexander H.
--Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks. In recent years, self-supervised learning (SSL) methods--learning frameworks that utilize unlabeled data without explicit supervision--have significantly advanced representation learning for audio processing. Speech SSL models like wav2vec 2.0 [1], HuBERT [2], and WavLM [3] have become the foundation of many applications like automatic speech recognition (ASR), speaker identification, and phoneme classification. In parallel, SSL approaches developed for audio event classification and music understanding, such as SSAST [4], BEA Ts [5], and MERT [6], have successfully been shown to be effective in non-speech tasks. In practice, the use of audio representation has extended beyond simple downstream tasks.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > China > Beijing > Beijing (0.04)
Appendix A Additional downstream evaluation tasks
We evaluated all models on three additional tasks, beyond those presented in the main paper. This is done by training a fully-connected head applied to each frame's The occurrence of state change is then predicted by training a binary linear classifier, using the concatenated representations as input. Discussion of Results The results on the additional downstream tasks are shown in Tab. For this experiment, we first categorize the activities based on the nature of the transition: 1 T1: irreversible interactions, backward transition highly unlikely (e.g., cut vegetables) T2: reversible interactions, backward transition occurs often (e.g., open/close fridge) T3: interactions with no transition direction (e.g., stirring). As expected, RepLAI learns better associations between the audio and visual state changes than A VID.
AudioMAE++: learning better masked audio representations with SwiGLU FFNs
Yadav, Sarthak, Theodoridis, Sergios, Tan, Zheng-Hua
ABSTRACT Masked Autoencoders (MAEs) trained on audio spectrogram patches have emerged as a prominent approach for learning self-supervised audio representations. While several recent papers have evaluated key aspects of training MAEs on audio data, the majority of these approaches still leverage vanilla transformer building blocks, whereas the transformer community has seen steady integration of newer architectural advancements. In this work, we propose AudioMAE++, a revamped audio masked autoencoder with two such enhancements, namely macaron-style transformer blocks with gated linear units. When pretrained on the AudioSet dataset, the proposed AudioMAE++ models outperform existing MAE based approaches on 10 diverse downstream tasks, demonstrating excellent performance on audio classification and speech-based benchmarks. The proposed AudioMAE++ models also demonstrate excellent scaling charecteristics, outperforming directly comparable standard MAE baselines with up to 4 more parameters.
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.40)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.40)
- North America > United States > Colorado (0.04)
- (4 more...)